About Data Source

The United States Census HIV/AIDS Surveillance Data Base contains information for all countries and areas of the world with at least 5,000 population, with the exception of Northern America (including the United States) and U.S. territories. Data included in the Data Base are drawn from medical and scientific journals, professional papers, official statistics, State Department cables, newspapers, magazines, materials from the World Health Organization, conference abstracts, and the Internet. Sources are scanned and reviewed for data.

Study Objective

The primary objective of this study is to elucidate the temporal dynamics and influential factors affecting HIV incidence rates on a global scale, with a particular focus on diverse sub-populations and geographic regions. We aim to achieve this by conducting a detailed meta-analysis that includes time series and regression analyses to identify significant trends, patterns, and associations. Specifically, our analysis leverages ARIMA modeling to investigate time-dependent trends and linear regression to assess the impact of socioeconomic factors, thereby informing targeted public health strategies. The study places a special emphasis on Sub-Saharan Africa, where preliminary findings indicate a general decline in incidence rates among key sub-populations, such as sex workers, since the early 1990s. By dissecting these trends across varying income groups and regions, we seek to prioritize interventions and allocate resources more effectively in the global fight against HIV.

Data Cleaning and Preparation

library(tidyverse) # for data cleaning
library(dplyr)     # for data cleaning
library(broom)     # for visualization
library(ggplot2)   # for visualization
library(forecast)  # for ARIMA models
library(ggplot2)   # for plotting
library(dplyr)     # for data manipulation
library(broom)     # for tidying model outputs

Since the dataset has inconsistency in sample size (some come with a sample size and some come without), we assigned the median value to those without sample size so that they could be included in the study without significantly influencing the sample size variable. Since the Data Base includes data from all countries worldwide with 5000+ population, we utilized an external country classification dataset from the World Bank to categorize the Data Base for analysis purposes.

incidence <- read.csv("surveillance/data/hiv_incidence.csv") |>
  mutate(
    # Replace missing INC_RATE with 0 or other appropriate value
    INC_RATE = replace(INC_RATE, is.na(INC_RATE), 0), 
    # Replace missing SAMPSIZE with median or other appropriate value
    SAMPSIZE = replace(SAMPSIZE, is.na(SAMPSIZE), median(SAMPSIZE, na.rm = TRUE)) 
  )

country_info <- readxl::read_excel("surveillance/CLASS.xlsx", sheet = "List of economies", col_names = TRUE)

incidence_grouped <- left_join(incidence, country_info, by = c("Country.Code" = "Code")) |> 
  janitor::clean_names()

categorize_decade <- function(year) {
  paste0(substr(year, 1, 3), "0s")
}

incidence_grouped <- incidence_grouped |> 
  mutate(decade = sapply(year, categorize_decade)) 

Trend Analysis

We attempted to understand trends within the dataset through two paths.

  • First, we categorized data by their continental region, their studied country’s national income level, and decade of which the study is done. We produced a heatmap for reported HIV incidence rate grouped by these categories.
  • Second, we categorized data by each study’s reported target population (or population subgroup). We produced a barchart for reported number of HIV cases grouped by these categories.
# For trend analysis involving region (continental), income_group (i.e., high-income, low-income, middle-income countries), and decade (1980s, 1990s, 2000s, 2010s, 2020s).
trend_continental <- incidence_grouped |>
  # Convert 'no_cases' to numeric (if it's not already)
  mutate(no_cases = as.numeric(no_cases)) |>
  # -1 is a placeholder in the no_cases and no_deaths column for NA. Replace -1 with 0 here.
  mutate(no_cases = ifelse(no_cases == -1, 0, no_cases)) |>
  # Group and summarize data by region, income_group, and decade
  group_by(region, income_group, decade) |>
  summarize(
    avg_inc_rate = mean(inc_rate, na.rm = TRUE),
    total_cases = sum(no_cases, na.rm = TRUE),
    .groups = 'drop'  # to drop grouping structure after summarization
  )

# For trend analysis involving population subgroup, e.g., MSM, IVDU, Transgender individuals, etc.
trend_subgroup <- incidence_grouped |>
  # Convert 'no_cases' to numeric (if it's not already)
  mutate(no_cases = as.numeric(no_cases)) |>
  # -1 is a placeholder in the no_cases and no_deaths column for NA. Replace -1 with 0 here.
  mutate(no_cases = ifelse(no_cases == -1, 0, no_cases)) |>
  # Group and summarize data by population subgroup
  group_by(population_subgroup) |>
  summarize(
    avg_inc_rate = mean(inc_rate, na.rm = TRUE),
    total_cases = sum(no_cases, na.rm = TRUE),
    .groups = 'drop'  # to drop grouping structure after summarization
  )
  1. Heatmap for Incidence Rate by Region, Decade, and National Income Level
trend_continental <- trend_continental |> 
  mutate(avg_inc_rate = as.numeric(avg_inc_rate)) 

heatmap_plot <- trend_continental |> 
  ggplot(aes(x = decade, y = region, fill = avg_inc_rate)) + 
  geom_tile(color = "white") +  
  scale_fill_gradient(low = "white", high = "red") +  
  facet_wrap(~income_group) +  
  theme_minimal() + 
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),  
    axis.title = element_text(size = 12),  
    legend.title = element_text(size = 10)  
  ) +
  labs(
    title = "Heatmap of HIV Incidence Rate by Region, Decade, and Income Level",
    x = "Decade",
    y = "Region",
    fill = "Avg Inc Rate"
  )

print(heatmap_plot)

  1. Bar Plot for Total Cases by Population Subgroup
trend_subgroup |> 
  ggplot(aes(y = population_subgroup, x = total_cases, fill = population_subgroup)) + 
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Total HIV Cases by Population Subgroup",
       y = "Population Subgroup",
       x = "Total Cases") +
  theme(axis.text.y = element_text(size = 5, angle = 0),
        legend.position = "none")

Meta-Analysis

Analysis Plan Flowchart

Data Preparation for Meta-Analysis

sorted_data <- incidence_grouped |> 
  arrange(year) |>
  # Convert 'no_cases' to numeric (if it's not already)
  mutate(no_cases = as.numeric(no_cases)) |>
  # -1 is a placeholder in the no_cases and no_deaths column for NA. Replace -1 with 0 here.
  mutate(no_cases = ifelse(no_cases == -1, 0, no_cases))

# Transform the dataset to have one observation per time point per subgroup
grouped_data <- sorted_data |> 
  group_by(year, region, income_group, population_subgroup) |> 
  summarize(
    mean_inc_rate = mean(inc_rate, na.rm = TRUE),
    sum_no_cases = sum(no_cases, na.rm = TRUE),
    .groups = 'drop'  # This ensures the resulting tibble is not grouped
  )

Time-Series Analysis

Time series analysis is appropriate for examining how data points indexed in time order (like yearly HIV incidence rates) change over time. It is suitable for analyzing trends, seasonal patterns, or forecasting future values. Since we wanted to assess HIV incidence rates by year, time series analysis is a good match.

library(dplyr)
library(forecast)
library(tseries)
library(purrr)

# Create a list of unique combinations of region, income_group, and population_subgroup
unique_combinations <- grouped_data |>
  distinct(region, income_group, population_subgroup)

# Function to perform time series analysis on a subset of data
perform_time_series_analysis <- function(subset_data) {
  if (length(subset_data$mean_inc_rate) < 50) {
    warning(paste("Not enough data points to fit ARIMA model for", unique(subset_data$region), unique(subset_data$income_group), unique(subset_data$population_subgroup)))
    return(NULL)
  }
  
  ts_data <- ts(subset_data$mean_inc_rate, start = min(subset_data$year), frequency = 1)
  
  # Check for stationarity
  adf_test <- adf.test(ts_data, alternative = "stationary")
  
  # Fit ARIMA model
  arima_model <- auto.arima(ts_data)
  
  # Diagnostics
  diagnostics <- list(
    checkresiduals(arima_model, plot = FALSE)
  )
  
  # Return the results
  return(list(arima_model = arima_model, adf_test = adf_test, diagnostics = diagnostics))
}

# Modify the threshold here
threshold <- 20  # or another number that makes sense for your data

# List to store results
time_series_results <- list()

# Loop over each unique combination
for(i in seq_len(nrow(unique_combinations))) {
  # Filter the data for the current combination
  current_data <- grouped_data |>
    filter(
      region == unique_combinations$region[i],
      income_group == unique_combinations$income_group[i],
      population_subgroup == unique_combinations$population_subgroup[i]
    )
  
  # Perform time series analysis if data points are above the threshold
  if (nrow(current_data) >= threshold) {
    ts_data <- ts(current_data$mean_inc_rate, start = min(current_data$year), frequency = 1)
    
    # Check for stationarity
    adf_test <- adf.test(ts_data, alternative = "stationary")
    
    # Fit ARIMA model
    arima_model <- auto.arima(ts_data)
    
    # Diagnostics
    diagnostics <- list(
      checkresiduals(arima_model, plot = FALSE)
    )
    
    # Store the result with a unique name
    time_series_results[[paste(unique_combinations$region[i], unique_combinations$income_group[i], unique_combinations$population_subgroup[i], sep = "_")]] <- list(arima_model = arima_model, adf_test = adf_test, diagnostics = diagnostics)
  } else {
    warning(paste("Not enough data points to fit ARIMA model for combination", i, ":", unique_combinations$region[i], unique_combinations$income_group[i], unique_combinations$population_subgroup[i]))
  }
}
## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,2) with drift
## Q* = 8.4892, df = 3, p-value = 0.03691
## 
## Model df: 2.   Total lags used: 5
# Check how many models were successfully fitted
length(time_series_results)
## [1] 1
# View results if any models were fitted
if (length(time_series_results) > 0) {
  time_series_results
} 
## $`Sub-Saharan Africa_Lower middle income_Sex workers`
## $`Sub-Saharan Africa_Lower middle income_Sex workers`$arima_model
## Series: ts_data 
## ARIMA(0,1,2) with drift 
## 
## Coefficients:
##           ma1     ma2    drift
##       -1.5900  0.9355  -1.0622
## s.e.   0.4624  0.5196   0.5354
## 
## sigma^2 = 70.67:  log likelihood = -89.56
## AIC=187.12   AICc=189.12   BIC=192
## 
## $`Sub-Saharan Africa_Lower middle income_Sex workers`$adf_test
## 
##  Augmented Dickey-Fuller Test
## 
## data:  ts_data
## Dickey-Fuller = -3.0323, Lag order = 2, p-value = 0.1791
## alternative hypothesis: stationary
## 
## 
## $`Sub-Saharan Africa_Lower middle income_Sex workers`$diagnostics
## $`Sub-Saharan Africa_Lower middle income_Sex workers`$diagnostics[[1]]
## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,2) with drift
## Q* = 8.4892, df = 3, p-value = 0.03691

Results

ARIMA Model:

  • Model Type: ARIMA(0,1,2) with drift indicates that the model is an ARIMA model with no autoregressive terms (p=0), one differencing (d=1 to make the series stationary), and two moving average terms (q=2).
  • Coefficients:
    • ma1: The first moving average coefficient is -1.5900, with a standard error of 0.4624. This is a significant coefficient given the magnitude compared to its standard error.
    • ma2: The second moving average coefficient is 0.9355, with a standard error of 0.5196. This coefficient is also significant.
    • drift: The coefficient for the drift term is -1.0622, with a standard error of 0.5354. This suggests a downward trend in the series.
  • Model Fit:
    • sigma^2: The estimated variance of the residuals is 70.67.
    • log likelihood: The log-likelihood of the model is -89.56, which is used to calculate AIC and BIC.
    • AIC, AICc, BIC: These are information criteria used to compare models, with lower values generally indicating a better fit. The AICc is an adjusted version of the AIC for small sample sizes.
  • Augmented Dickey-Fuller Test:
    • This test checks for the presence of unit roots, which are a sign of non-stationarity in the time series.
    • The Dickey-Fuller value is -3.0323 with a p-value of 0.1791. Since the p-value is greater than the common significance level of 0.05, the test does not reject the null hypothesis of a unit root being present, suggesting that the time series may not be stationary despite the differencing.
  • Ljung-Box Test:
    • This test is used to determine if there are significant autocorrelations in the residuals at lag k. Ideally, we want the p-value to be above 0.05 to conclude that there are no significant autocorrelations.
    • The test statistic is Q* = 8.4892 with a p-value of 0.03691. Since the p-value is less than 0.05, this suggests that there is evidence of significant autocorrelation at lag k within the residuals, which means the model may not be adequately capturing all the patterns in the data.

Interpretations of results

  • The ARIMA model suggests a decreasing trend in the incidence rates for the subgroup.
  • The ADF test suggests that the series may not be completely stationary, indicating that further differencing or a different model specification may be necessary.
  • The significant Ljung-Box test implies that there are autocorrelations in the residuals that the model is not capturing, which could mean that important predictors or higher-order lags might be missing from the model. Given the limited information on the Data Base we are using, we are unable to progress further. However, future research may consider incorporating other predictors for the model.

Time Series Trend Visualization

1. Select which model/combination to visualize

# Loop through the results and print the summary of each ARIMA model
for (combination in names(time_series_results)) {
  cat("\nCombination:", combination, "\n")
  print(summary(time_series_results[[combination]]$arima_model))
}
## 
## Combination: Sub-Saharan Africa_Lower middle income_Sex workers 
## Series: ts_data 
## ARIMA(0,1,2) with drift 
## 
## Coefficients:
##           ma1     ma2    drift
##       -1.5900  0.9355  -1.0622
## s.e.   0.4624  0.5196   0.5354
## 
## sigma^2 = 70.67:  log likelihood = -89.56
## AIC=187.12   AICc=189.12   BIC=192
## 
## Training set error measures:
##                      ME     RMSE      MAE      MPE    MAPE      MASE      ACF1
## Training set -0.6193829 7.732668 5.426826 1.645181 139.093 0.7643619 0.4777973

Based on the summary of the ARIMA model for the combination “Sub-Saharan Africa_Lower middle income_Sex workers,” it looks like an interesting case to visualize. The ARIMA(0,1,2) model with drift indicates that the time series is differenced once (to make it stationary) and includes two moving average terms. The drift term suggests a linear trend over time.

2. Time Series Trend Visualization

# Extracting the data and model for "Sub-Saharan Africa_Lower middle income_Sex workers"
selected_model <- time_series_results[["Sub-Saharan Africa_Lower middle income_Sex workers"]]$arima_model
selected_data <- grouped_data %>%
  filter(
    region == "Sub-Saharan Africa",
    income_group == "Lower middle income",
    population_subgroup == "Sex workers"
  )
# Plotting the actual time series data and the fitted model
ggplot(selected_data, aes(x = year, y = mean_inc_rate)) +
  geom_line() +
  geom_line(aes(y = fitted(selected_model)), color = "red") +
  labs(title = "Time Series Trend and ARIMA Model for Sub-Saharan Africa, Lower Middle Income, Sex Workers",
       x = "Year",
       y = "Mean Incidence Rate") +
  theme_minimal()

  • Black Line (Actual Data): Represents the actual observed values of the mean incidence rate over time. This line shows the fluctuations in the data across years.

  • Red Line (Fitted ARIMA Model): Represents the values as fitted by the ARIMA model. The red line seems to follow the general downward trend of the actual data, indicating that the model is capturing the overall trend well.

  • Interpretation of Trend: There is a notable decrease in the mean incidence rate from the early 1990s, which then stabilizes into a more fluctuating pattern from the 2000s onwards. The red line (fitted values) is capturing this trend, suggesting that the ARIMA model is accounting for the major trend in the data.

3. Diagnostic Plots for the ARIMA Model

# Diagnostic plots for the ARIMA model of the selected combination
checkresiduals(selected_model)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,2) with drift
## Q* = 8.4892, df = 3, p-value = 0.03691
## 
## Model df: 2.   Total lags used: 5
  • Top Plot (Residuals vs. Time): This plot shows the residuals of the model over time. Ideally, the residuals should be randomly scattered around zero, indicating that the model has captured most of the trend and seasonality. In your graph, residuals do not show any clear pattern over time, which is good, but there seem to be certain years with higher deviations from zero, which could indicate model misspecification or the presence of outliers/anomalies.

  • Bottom Left Plot (ACF of Residuals): This plot shows the autocorrelations of the residuals at different lags. Ideally, we want to see these autocorrelations fall within the blue dashed lines, which represent the confidence intervals. If they are within the blue lines, it suggests that there is no significant autocorrelation in the residuals, and the model has captured the time series’ structure well. Our plot shows some autocorrelations at different lags crossing the confidence bounds, which might suggest that the model could be improved to better capture the time series dynamics.

  • Bottom Right Plot (Histogram and Q-Q Plot of Residuals): The histogram with the overlaid density plot and the Q-Q plot assess the normality of the residuals. The histogram seems to show that residuals may be normally distributed, but the Q-Q plot indicates some deviations from normality, particularly in the tails. This suggests that the residuals have some heavier tails than the normal distribution, which can occur if there are outliers or extreme values in our data.

Linear Regression Analysis

Regression analysis is used to understand the relationship between a dependent variable and one or more independent variables. It is suitable for questions like how various factors (like region, income level, population subgroup) are associated with the outcome (e.g., HIV incidence rate). However, it should be noted that linear regression has multiple assumptions as all other regressions, and on top of these, linear regression assums normal distribution of the residuals.

library(dplyr)
library(broom)

# Create a list of unique combinations of region and income_group
combinations_region_income <- grouped_data |>
  distinct(region, income_group)

# Create a list of unique population_subgroup
combinations_subpopulation <- grouped_data |>
  distinct(population_subgroup)

# Function to perform regression analysis on a subset of data
perform_regression_analysis <- function(subset_data) {
  if (nrow(subset_data) < 10) {  # Setting a threshold for the minimum number of data points
    warning("Not enough data points to fit regression model for the given combination.")
    return(NULL)
  }
  
  # Fit linear regression model
  reg_model <- lm(mean_inc_rate ~ year, data = subset_data)
  
  # Collect the summary statistics of the model
  reg_summary <- summary(reg_model)
  
  # Collect tidy results of the model for easy handling
  tidy_results <- tidy(reg_model)
  
  # Return the results
  return(list(reg_model = reg_model, reg_summary = reg_summary, tidy_results = tidy_results))
}

# List to store results for region and income_group
regression_region_income_results <- list()

# Loop over each region & income level combination
for(i in seq_len(nrow(combinations_region_income))) {
  # Filter the data for the current combination
  current_data <- grouped_data |>
    filter(
      region == combinations_region_income$region[i],
      income_group == combinations_region_income$income_group[i]
    )
  
  # Perform regression analysis
  result <- perform_regression_analysis(current_data)
  
  # Store the result with a unique name if result is not NULL
  if (!is.null(result)) {
    regression_region_income_results[[paste(combinations_region_income$region[i], combinations_region_income$income_group[i], sep = "_")]] <- result
  }
}

# List to store results for population_subgroup
regression_subpopulation_results <- list()

# Loop over each population_subgroup
for(i in seq_len(nrow(combinations_subpopulation))) {
  # Filter the data for the current population_subgroup
  current_data <- grouped_data |>
    filter(
      population_subgroup == combinations_subpopulation$population_subgroup[i]
    )
  
  # Perform regression analysis
  result <- perform_regression_analysis(current_data)
  
  # Store the result with a unique name if result is not NULL
  if (!is.null(result)) {
    regression_subpopulation_results[[combinations_subpopulation$population_subgroup[i]]] <- result
  }
}

# Check the first element of regression_region_income_results to make sure it contains a valid model
if(length(regression_region_income_results) > 0) {
  print(regression_region_income_results[[1]])
}
## $reg_model
## 
## Call:
## lm(formula = mean_inc_rate ~ year, data = subset_data)
## 
## Coefficients:
## (Intercept)         year  
##    0.230660     0.001763  
## 
## 
## $reg_summary
## 
## Call:
## lm(formula = mean_inc_rate ~ year, data = subset_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7835 -3.3772 -0.2461  1.2854 10.3973 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.307e-01  2.219e+02   0.001    0.999
## year        1.763e-03  1.103e-01   0.016    0.987
## 
## Residual standard error: 4.144 on 16 degrees of freedom
## Multiple R-squared:  1.596e-05,  Adjusted R-squared:  -0.06248 
## F-statistic: 0.0002554 on 1 and 16 DF,  p-value: 0.9874
## 
## 
## $tidy_results
## # A tibble: 2 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)  0.231     222.      0.00104   0.999
## 2 year         0.00176     0.110   0.0160    0.987
# Check the first element of regression_subpopulation_results to make sure it contains a valid model
if(length(regression_subpopulation_results) > 0) {
  print(regression_subpopulation_results[[1]])
}
## $reg_model
## 
## Call:
## lm(formula = mean_inc_rate ~ year, data = subset_data)
## 
## Coefficients:
## (Intercept)         year  
##    57.78495     -0.02773  
## 
## 
## $reg_summary
## 
## Call:
## lm(formula = mean_inc_rate ~ year, data = subset_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.436 -1.851 -1.070  1.956  4.906 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  57.78495  106.35869   0.543    0.591
## year         -0.02773    0.05298  -0.523    0.605
## 
## Residual standard error: 2.165 on 29 degrees of freedom
## Multiple R-squared:  0.009357,   Adjusted R-squared:  -0.0248 
## F-statistic: 0.2739 on 1 and 29 DF,  p-value: 0.6047
## 
## 
## $tidy_results
## # A tibble: 2 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)  57.8     106.         0.543   0.591
## 2 year         -0.0277    0.0530    -0.523   0.605

Regression Model Diagnostics

Based on the results, we performed the following model diagnostics:

  • Check for Linearity: Ensure that the relationship between the predictor (year) and the response variable (mean_inc_rate) is linear.
  • Investigate Outliers: Large residuals may indicate outliers, which can unduly influence the regression model. It might be worth investigating these data points to see if they should be included in the analysis.
  • Consider Model Complexity: The model may be too simple. Perhaps the relationship between year and incidence rate is not linear, or there may be other variables or transformations that could improve the model.
# Function to plot diagnostics for a regression model
plot_regression_diagnostics <- function(reg_model) {
  par(mfrow = c(2, 2))
  plot(reg_model)
}

# Perform diagnostics on the first element of regression_region_income_results
if(length(regression_region_income_results) > 0) {
  plot_regression_diagnostics(regression_region_income_results[[1]]$reg_model)
}

# Perform diagnostics on the first element of regression_subpopulation_results
if(length(regression_subpopulation_results) > 0) {
  plot_regression_diagnostics(regression_subpopulation_results[[1]]$reg_model)
}

For both sets of plots, the Residuals vs Fitted and Scale-Location plots look acceptable, though there is some indication of potential heteroscedasticity in the second set. The Q-Q Plots suggest that the residuals may not be perfectly normally distributed, especially for the second model. The Residuals vs Leverage plots indicate that there are some influential points in the first model, particularly observation 70. All of these warrants us to be careful in terms of our final interpretation/conclusion.

Results

  • Model Coefficients: The estimated coefficients for the year variable in both models are quite small, indicating that time may not have a strong influence on the mean_inc_rate for these particular subsets of data.
  • Statistical Significance: The p-values associated with the year coefficient are very high (0.987 for the region-income model and 0.605 for the subpopulation model), which means that the year variable is not statistically significant in predicting the mean_inc_rate in these models.
  • Model Fit: The R-squared values are very close to zero, which suggests that the models explain none of the variability in the response data around its mean.
  • Residuals: The residuals’ summary shows that there might be some large residuals in the data, which could be potential outliers or indicate model misspecification.

Sensitivity Analysis

Sensitivity analysis in the context of regression analysis is used to determine how sensitive your results are to changes in the model. This involves testing how the model’s predictions vary with changes in the modeling approach, such as using different subsets of data, handling outliers, or changing the model specification.

1. Varying Model Specification

# Original model
original_model <- lm(mean_inc_rate ~ year + region + income_group, data = grouped_data)

# Model excluding a potentially influential variable
model_without_variable <- lm(mean_inc_rate ~ year + region, data = grouped_data)

# Compare the results
summary(original_model)
## 
## Call:
## lm(formula = mean_inc_rate ~ year + region + income_group, data = grouped_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.915 -3.420 -1.538  1.574 49.944 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                      127.41820   62.21955   2.048  0.04094 * 
## year                              -0.06142    0.03099  -1.982  0.04783 * 
## regionEurope & Central Asia        5.17868    1.72935   2.995  0.00284 **
## regionLatin America & Caribbean   -0.63691    0.77852  -0.818  0.41357   
## regionMiddle East & North Africa  -2.46132    1.66594  -1.477  0.14000   
## regionSouth Asia                   1.18550    1.40394   0.844  0.39873   
## regionSub-Saharan Africa           1.05570    0.68748   1.536  0.12508   
## income_groupLow income            -1.41851    1.32838  -1.068  0.28595   
## income_groupLower middle income    0.50643    1.24616   0.406  0.68458   
## income_groupUpper middle income    0.11720    1.09376   0.107  0.91470   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.841 on 708 degrees of freedom
## Multiple R-squared:  0.03775,    Adjusted R-squared:  0.02552 
## F-statistic: 3.086 on 9 and 708 DF,  p-value: 0.001211
summary(model_without_variable)
## 
## Call:
## lm(formula = mean_inc_rate ~ year + region, data = grouped_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.733 -3.491 -1.546  1.413 50.530 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                      97.36613   60.70513   1.604  0.10918   
## year                             -0.04641    0.03019  -1.537  0.12473   
## regionEurope & Central Asia       5.16548    1.73263   2.981  0.00297 **
## regionLatin America & Caribbean  -0.62219    0.77997  -0.798  0.42531   
## regionMiddle East & North Africa -2.23179    1.56566  -1.425  0.15446   
## regionSouth Asia                  1.53599    1.25906   1.220  0.22289   
## regionSub-Saharan Africa          0.71239    0.49622   1.436  0.15155   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.861 on 711 degrees of freedom
## Multiple R-squared:  0.02728,    Adjusted R-squared:  0.01907 
## F-statistic: 3.323 on 6 and 711 DF,  p-value: 0.003106

Original Model vs. Model Without income_group:

  • The presence or absence of income_group as a predictor does not seem to drastically change the R-squared value, indicating that while income_group contributes some information, it might not be a major driver in the model.
  • The coefficients of year and region remain relatively stable between the two models, suggesting that their influence is somewhat independent of income_group.
  • The significance levels (Pr(>|t|)) for most variables remained similar, indicating that the overall statistical relationships in your model are robust to the inclusion/exclusion of income_group.

2. Handling Outliers

# Identify outliers (example using Cook's distance)
cooksd <- cooks.distance(original_model)
outlier_threshold <- 4 / length(cooksd)
outliers <- which(cooksd > outlier_threshold)

# Model without outliers
model_without_outliers <- lm(mean_inc_rate ~ year + region + income_group, data = grouped_data[-outliers,])

# Compare the results
summary(original_model)
## 
## Call:
## lm(formula = mean_inc_rate ~ year + region + income_group, data = grouped_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.915 -3.420 -1.538  1.574 49.944 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                      127.41820   62.21955   2.048  0.04094 * 
## year                              -0.06142    0.03099  -1.982  0.04783 * 
## regionEurope & Central Asia        5.17868    1.72935   2.995  0.00284 **
## regionLatin America & Caribbean   -0.63691    0.77852  -0.818  0.41357   
## regionMiddle East & North Africa  -2.46132    1.66594  -1.477  0.14000   
## regionSouth Asia                   1.18550    1.40394   0.844  0.39873   
## regionSub-Saharan Africa           1.05570    0.68748   1.536  0.12508   
## income_groupLow income            -1.41851    1.32838  -1.068  0.28595   
## income_groupLower middle income    0.50643    1.24616   0.406  0.68458   
## income_groupUpper middle income    0.11720    1.09376   0.107  0.91470   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.841 on 708 degrees of freedom
## Multiple R-squared:  0.03775,    Adjusted R-squared:  0.02552 
## F-statistic: 3.086 on 9 and 708 DF,  p-value: 0.001211
summary(model_without_outliers)
## 
## Call:
## lm(formula = mean_inc_rate ~ year + region + income_group, data = grouped_data[-outliers, 
##     ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5939 -2.6421 -0.9381  1.7817 13.7103 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                      49.66804   38.26486   1.298  0.19473   
## year                             -0.02355    0.01905  -1.236  0.21675   
## regionEurope & Central Asia       4.19776    1.97847   2.122  0.03423 * 
## regionLatin America & Caribbean  -0.16126    0.46045  -0.350  0.72629   
## regionMiddle East & North Africa -3.03315    1.01515  -2.988  0.00291 **
## regionSouth Asia                 -0.02764    0.87981  -0.031  0.97494   
## regionSub-Saharan Africa          0.62400    0.40962   1.523  0.12814   
## income_groupLow income            0.10593    0.82565   0.128  0.89796   
## income_groupLower middle income   1.37865    0.77953   1.769  0.07742 . 
## income_groupUpper middle income   1.12057    0.68721   1.631  0.10344   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.403 on 671 degrees of freedom
## Multiple R-squared:  0.04719,    Adjusted R-squared:  0.03441 
## F-statistic: 3.692 on 9 and 671 DF,  p-value: 0.0001565

Original Model vs. Model Without Outliers:

  • Removing outliers leads to a noticeable change in the coefficients and their standard errors, suggesting that outliers had some influence on the model estimates.
  • The R-squared value improves slightly in the model without outliers, which could indicate a better fit.
  • The residual standard error decreases significantly when outliers are removed, suggesting that these outliers were contributing to a larger error in the original model.

3. Using Different Data Subsets

# Splitting the data into two subsets
set.seed(123)  # for reproducibility
data_subset1 <- grouped_data[sample(nrow(grouped_data), nrow(grouped_data) / 2), ]
data_subset2 <- grouped_data[!rownames(grouped_data) %in% rownames(data_subset1), ]

# Fit the model to each subset
model_subset1 <- lm(mean_inc_rate ~ year + region + income_group, data = data_subset1)
model_subset2 <- lm(mean_inc_rate ~ year + region + income_group, data = data_subset2)

# Compare the results
summary(model_subset1)
## 
## Call:
## lm(formula = mean_inc_rate ~ year + region + income_group, data = data_subset1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.208 -3.174 -1.463  1.681 35.677 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                      121.27485   82.14200   1.476    0.141  
## year                              -0.05851    0.04087  -1.431    0.153  
## regionEurope & Central Asia        2.24783    2.40264   0.936    0.350  
## regionLatin America & Caribbean   -1.59627    0.93883  -1.700    0.090 .
## regionMiddle East & North Africa  -3.09265    2.22111  -1.392    0.165  
## regionSouth Asia                  -0.11015    2.07443  -0.053    0.958  
## regionSub-Saharan Africa           1.41334    0.87997   1.606    0.109  
## income_groupLow income            -1.97526    1.79582  -1.100    0.272  
## income_groupLower middle income    0.41310    1.71055   0.242    0.809  
## income_groupUpper middle income    0.65587    1.47201   0.446    0.656  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.166 on 349 degrees of freedom
## Multiple R-squared:  0.04917,    Adjusted R-squared:  0.02465 
## F-statistic: 2.005 on 9 and 349 DF,  p-value: 0.03793
summary(model_subset2)
## 
## Call:
## lm(formula = mean_inc_rate ~ year + region + income_group, data = data_subset2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.892  -3.212  -1.157   1.350  48.761 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                      -390.27351  192.70970  -2.025  0.04361 * 
## year                                0.19479    0.09565   2.036  0.04247 * 
## regionEurope & Central Asia         6.41219    2.23264   2.872  0.00433 **
## regionLatin America & Caribbean     0.42624    1.13003   0.377  0.70626   
## regionMiddle East & North Africa   -3.87273    1.75045  -2.212  0.02758 * 
## regionSouth Asia                    0.67433    1.99230   0.338  0.73521   
## regionSub-Saharan Africa            1.63880    0.80680   2.031  0.04299 * 
## income_groupLow income              0.98936    1.94090   0.510  0.61056   
## income_groupLower middle income     1.51186    1.79845   0.841  0.40112   
## income_groupUpper middle income     1.41738    1.61209   0.879  0.37989   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.35 on 349 degrees of freedom
## Multiple R-squared:  0.07208,    Adjusted R-squared:  0.04815 
## F-statistic: 3.012 on 9 and 349 DF,  p-value: 0.001779

Model on Subset 1 vs. Model on Subset 2:

  • There are some differences in the coefficients and their significance between the two subsets, indicating potential variability in the data or the influence of specific predictors.
  • The R-squared values differ between the subsets, which could indicate variability in how well the model explains the variance in the dependent variable across different samples of the data.

4. Bootstrap Resampling

# Bootstrap resampling
library(boot)

# Function to obtain coefficients
boot_function <- function(data, indices) {
  sample_data <- data[indices, ]
  model <- lm(mean_inc_rate ~ year + region + income_group, data = sample_data)
  return(coef(model))
}

# Running bootstrap
bootstrap_results <- boot(grouped_data, boot_function, R = 1000)

# View summary of bootstrap results
boot.ci(bootstrap_results, type = "bca")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = bootstrap_results, type = "bca")
## 
## Intervals : 
## Level       BCa          
## 95%   (-46.1, 296.6 )  
## Calculations and Intervals on Original Scale

Bootstrap Confidence Intervals:

  • The BCa interval you provided appears to be wide ((-46.1, 296.6)), suggesting a considerable amount of uncertainty in the estimate. However, without knowing which coefficient this interval corresponds to, it’s challenging to draw specific conclusions.
  • Wide confidence intervals in bootstrap results suggest variability in your model’s estimates and a need for caution in interpreting the model’s predictive power.

5. Interpreting Sensitivity Analysis

General Observations

  • Our model shows some sensitivity to the inclusion of certain predictors and the presence of outliers. This suggests the need for careful consideration of model specifications and data quality.
  • The differences in model outcomes between data subsets imply that the relationships in our data is entirely consistent across different samples. This could be due to underlying variability in the data or different patterns within subgroups of the data.

Recommendations for Future Research

  • Consider whether any additional variables, not included in the model, might explain some of the variability and improve the model’s consistency.
  • Explore if different modeling approaches, such as segmented regression or hierarchical models, might better account for the variability in the data.

Conclusion

The study has identified meaningful patterns in the incidence rates of HIV among various subgroups in Sub-Saharan Africa. The time series analysis, particularly the ARIMA(0,1,2) model with drift, has captured a notable decreasing trend in HIV incidence rates since the early 1990s. This trend stabilizes somewhat in the 2000s, with some fluctuation, but does not show a significant increase, indicating a potential effectiveness of interventions and awareness over time. Diagnostic plots suggest that while the model fits well overall, there are indications of outliers or anomalous events that could be affecting the residuals, pointing to areas where the model could be refined.

The linear regression analysis further enhances our understanding by evaluating the impact of various factors such as year, region, and income group on HIV incidence rates. Sensitivity analysis indicates that the model’s findings are relatively robust, although some sensitivity to model specifications and outliers was noted. This suggests that while the model provides a solid baseline understanding, it could benefit from further refinement.

The overall conclusion is that HIV incidence rates in Sub-Saharan Africa show a decreasing trend over the years among sex workers, with some variability that might be explained by regional and economic factors. The analysis underscores the importance of tailored public health interventions and the need for continuous monitoring and adaptation to sustain and enhance the progress made in HIV prevention and treatment.